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Abstract 

Kernel density estimation (KDE) is a popular statistical technique for estimating the under- 
lying density distribution with minimal assumptions. Although they can be shown to achieve 
asymptotic estimation optimality for any input distribution, cross-validating for an optimal 
parameter requires significant computation dominated by kernel summations. In this paper 
we present an improvement to the dual-tree algorithm, the first practical kernel summation 
algorithm for general dimension. Our extension is based on the series-expansion for the Gaus- 
sian kernel used by fast Gauss transform. First, we derive two additional analytical machinery 
for extending the original algorithm to utilize a hierarchical data structure, demonstrating the 
first truly hierarchical fast Gauss transform. Second, we show how to integrate the series- 
expansion approximation within the dual-tree approach to compute kernel summations with a 
user-controllable relative error bound. We evaluate our algorithm on real-world datasets in the 
context of optimal bandwidth selection in kernel density estimation. Our results demonstrate 
that our new algorithm is the only one that guarantees a hard relative error bound and offers 
fast performance across a wide range of bandwidths evaluated in cross validation procedures. 

1 Introduction 

Kernel density estimation (KDE) is the most widely used and studied nonparametric density esti- 
mation method. The model is the reference dataset TZ itself, containing the reference points indexed 
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Algorithm 1 NaiveKDE(Q, 7^): A brute-force computation of KDE. 
for each qi d Q do 
G(<z„7^)^0 
for each rj e TZ do 

G{q,,n)^G{q,,n)+Kh{U-r,\\) 
Normahze each G{qi,TZ) 



by natural numbers. Assume a local kernel function Kh{-) centered upon each reference point, and 
its scale parameter h (the 'bandwidth'). The common choices for Kh{ ) include the spherical, Gaus- 
sian and Epanechnikov kernels. We are given the query dataset Q containing query points whose 
densities we want to predict. The density estimate at the z-th query point qi € Q is: 

where \\qi — rj\\ denotes the Euclidean distance between the j-th query point qi and the j-th 
reference point r^, D the dimensionality of the data, \R\ the size of the reference dataset, and 

oo 

= / Kh{z)dz, a normalizing constant depending on D and h. With no assumptions on the 

— oo 

true underlying distribution, if ft, — !■ and \TZ\h — )■ oo and K{-) satisfy some mild conditions: 

J \ph{x) - p{x)\dx ^ (2) 

as \TZ\ — ?> oo with probability 1. As more data are observed, the estimate converges to the true 
density. In order to build our model for evaluating the densities at each qi g Q, we need to 
find the initially unknown asymptotically optimal bandwidth h* for the given reference dataset 
TZ. There are two main types of cross-validation methods for selecting the asymptotically optimal 
bandwidth. Cross-validation methods use the reference dataset TZ as the query dataset Q (i.e. 
Q = TZ). Likelihood cross-validation is derived by minimizing the KuUback-Leibler divergence 

/ p{x) log j^^dx, which yields the score: 

CVLKih) = ^ E ^°SPh,-jir,) (3) 

where the —j subscript denotes an estimate using all \7Z\ points except the j-th reference point. The 
bandwidth h^y^^ that maximizes CVLxih) is an asymptotically optimal bandwidth in likelihood 
cross validation sense. Least-squares cross-validation minimizes the integrated squared error 
/ {Ph{x) — pixyf' dx, yielding the score: 

CVLsih) = ^ E iP-^^''^^ - "^P-Ar,)) (4) 

where P*-j{-) is evaluated using the convolution kernel Kh{-) * Kh{-)- For the Gaussian kernel 
with bandwidth of h, the convolution kernel Kh{-) * Kh{ ) is the Gaussian kernel with bandwidth 
of 2h. Both cross validation scores require \TZ\ density estimate based on \TZ\ — 1 points, yielding 
a brute-force computational cost scaling quadratically (that is 0(|7?,p)) (see Algorithm [T]) . To 
make matters worse, nonparametric methods require a large number of reference points for conver- 
gence to the true underlying distribution and this has prevented many practitioners from applying 
nonparametric methods for function estimation. 
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1.1 Efficient Computation of Gaussian Kernel Sums 



One of the most commonly used kernel function is the Gaussian kernel, Kii{\\qi — rj\\) = e , 
although it is not the asymptotically optimal kernel. In this paper we focus on evaluating the 
Gaussian sums efficiently for each qt E Q: 

which is proportional to p{qi) using the Gaussian kernel. This computationally expensive sum 
is evaluated many times when cross-validating for an asymptotically optimal bandwidth for the 
Gaussian kernel. Algorithms have been developed to approximate the Gaussian kernel sums at the 
expense of reduced precision. We consider the following two error bound criteria that measure the 
quality of the approximation with respect to the true value. 

Definition 1.1. (Bounding the absolute error) An approximation algorithm guarantees e 
absolute error hound, if for each exact value ^{qi,TZ), it computes an approximation ^{qi,TZ) such 
that |$((^^,7^)-$(g^,7^)| < e. 

Definition 1.2. (Bounding the relative error) An approximation algorithm guarantees e 
relative error bound, if for each exact value ^{qi,TZ), it computes an approximation $(xq,7?,) such 
that |$(q„7^) - <^>{q^,n)\ < emq„TZ)\. 

Bounding the relative error is much harder because the error bound is in terms of the initially 
unknown exact quantity. Many previous methods [TTJ [H] have focused on bounding the absolute 
error. Nevertheless, the relative error bound criterion is preferred to the absolute error bound 
criterion in statistical applications. Therefore, our experiment will evaluate the performance of the 
algorithms for achieving the user-specified relative error tolerance. Our new algorithm which builds 
upon [9l m 13 is the only one to guarantee both the absolute error and the relative error bound 
criterion for all density estimates. 

1.2 Previous Approaches 

There are three main approaches proposed for overcoming the computational barrier in evaluating 
the Gaussian kernel sums: 

1. to expand the kernel sum as a power series I19[ [13] using a grid or a flat-clustering. 

2. to express the kernel sum as a convolution sum by using the grid of field charges created from 
the dataset [T5] . 

3. to utilize an adaptive hierarchical structure to group data points based on proximity [HlinilZ]- 
Now we briefly describe the strengths and the weaknesses of these methods. 

The Fast Gauss Transform (FGT). FGT |TT] belongs to a family of methods called the Fast 
Multipole Methods (FMM). These family of methods come with rigorous error bound on the kernel 



sums. Unlike other FMM algorithms, FGT uses a grid structure (see Figure 1(a) ) whose maximum 



side length is restricted to be at most the bandwidth h used in cross-validation due to the error 
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(a) (b) 

Figure 1: (a) Grid structure used in fast Gauss transform and multidimensional fast Fourier trans- 
form, (b) Single-level Clustering structure used in improved fast Gauss transform. 



bound criterion. FGT has not been widely used in higher dimensional statistical contexts. First, 
the number of the terms in the power series expansion for the kernel sums grows exponentially 
with dimensionality D] this causes computational bottleneck in evaluating the series expansion or 
translating a series expansion from one center to another. Second, the grid structure is extremely 
inefficient in higher dimensions since the storage cost is exponential in D and many of the boxes 
will be empty. 

The Improved Fast Gauss Transform (IFGT). IFGT is similar to FMM but utihzes a flat 



clustering to group data points (see Figure 1(b) ), which is more efficient than a grid structure used 
in FGT. The number of clusters k is chosen in advance. A partition of the data points into Ci, C2, 
• • • , Cfc is formed so that each reference point rj G 7?, is grouped according to its proximity to the 
set of representative points ci, C2, ■ ■ ■ , c^. That is, rj € Cm (whose representative point is Cm) if 
and only if ||rj — Cm|| < H^'j — q|| for 1 < / < fc. 

Furthermore, IFGT proposes using a different series expansion that does not require translation 
of expansion centers as done in FGT. The original algorithm required tweaking of multiple 
parameters which did not offer for a user to control the accuracy of the approximation. The latest 
version |13j is now fully automatic in choosing the approximation parameter for the absolute error 
bound, but is still inefficient except on large bandwidth parameters. We will discuss this further in 
Section g] 

Fast Fourier Transform (FFT). FFT is often quoted as the solution to the computational 
problem in evaluating the Gaussian kernel sums. Gaussian kernel summation using FFT is described 
in [M] and [H]. [M] discusses the implementation of KDE only in a univariate case, while [l8] 
extends |14j to handle more than one dimension. It uses a grid structure shown in Figure l(a)| by 
specifying the number of grid points along each dimension. 

The algorithm first computes the Mi x • • • x Mjj matrix by binning the data assigning the raw data 
to neighboring grid points using one of the binning rules. This involves computing the minimum and 
maximum coordinate values (gi,Mi , and the grid width Si — for each i-th dimension. 

This essentially divides each z-th dimension into Af, — 1 intervals of equal length. In particular, |18) 
discusses two different types of binning rules - linear binning, which is recommended by Silverman, 
and nearest-neighbor binning. 18 states that nearest- neighbor binning rule performs poorly, so 
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(a) Nearest Neighbor Binning Rule {A = 
1,S = C = D = 0) 
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(b) Linear Binning Rule (A = | , S = 



Figure 2: Two possible binning rules for KDE using multidimensional fast Fourier transform. 
Consider a data point falling in a two-dimensional rectangle. In 2(a) the entire weight is assigned 
to the nearest grid point. In 2(b)[ the weight is distributed to all neighboring grid points by linear 
interpolation. 



we will test the implementation using the linear binning rule, as recommended by both authors. In 
addition, we compute the Li x ■ ■ ■ x L]j kernel weight matrix, where Li — min ^ , — 1^, 

with T w 4 and T^i = J] e~^T^, -Lk < k < Lk, for I = [h, ...Jof e Z^. 

k=l 

To reduce the wrap-around effects of fast Fourier transform near the dataset boundary, we 
appropriately zero-pad the grid count and the kernel weight matrices to two matrices of the di- 
mensionality Pi X • • -Pd, where Pi = 2'°S2r^^^i+^il . The key ingredient in this method is the use 
of Convolution Theorem for Fourier transforms. The structure of the computed grid count ma- 
trix and the kernel weight matrix is crafted to take advantage of the fast Fourier transform. For 

Li Ljj 

every grid point g {gij^, gdj^), s'kigj) = ' ' ' J2 Cj-iK^j can be computed using 

/i — — Li I D — ~ L £) 

the Convolution Theorem for Fourier Transform. After taking the convolution of the grid count 
matrix and the kernel weight matrix, the Mi x • • • x sub-matrix in the upper left corner of 
the resultant matrix contains the kernel density estimate of the grid points. The density estimate 
of each query point is then linearly interpolated using the density estimates of neighboring grid 
points inside the cell it falls into. However, performing a calculation on equally-spaced grid points 
introduces artifacts at the boundaries of the data. The linear interpolation of the data points by 
assigning to neighboring grid points introduce further errors. Increasing the number of grid points 
to use along each dimension can provide more accuracy but also require more space to store the 
grid. Moreover, it is impossible to directly quantify incurred error on each estimate in terms of the 
number of grid points. 

Dual-tree KDE. In terms of discrete algorithmic structure, the dual-tree framework of [8] gener- 
alizes all of the well-known kernel summation algorithms. These include the Barnes-Hut algorithm 
[2], the Fast Muhipole Method [10], Appel's algorithm J^, and the WSPD [5]: the dual-tree method 
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The grid count matrix: — 

KliO 



( Ci,i • • • Cx,M^ ^ 
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The kernel weight matrix: K — 
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Figure 3: The grid count and the kernel weight matrix formed for a two-dimensional dataset. They 
are formed by appropriately zero-padding for taking the boundary-effects of fast Fourier transform 
based algorithms into account. 



is a node-node algorithm (considers query regions rather than points), is fully recursive, can use 
distribution-sensitive data structures such as fcd-trees, and is bichromatic (can specialize for differ- 
ing query set Q and reference set TZ) . It was applied to the problem of kernel density estimation in 
[9] using a simple variant of a centroid approximation used in [1] . 

This algorithm is currently the fastest Gaussian kernel summation algorithm for general di- 
mensions. Unfortunately, when performing cross-validation to determine the (initially unknown) 
optimal bandwidth, both sub-optimally small and large bandwidths must be evaluated. Section 2] 
demonstrates that the dual-tree method tends to be efficient at the optimal bandwidth and at 
bandwidths below the optimal bandwidth and at very large bandwidths. However, its performance 
degrades for intermediately large bandwidths. 



1.3 Our Contribution 

In this paper we present an improvement to the dual-tree algorithm [nHSlH' &rst practical kernel 
summation algorithm for general dimension. Our extension is based on the series-expansion for the 
Gaussian kernel used by fast Gauss transform [TI]. First, we derive two additional analytical 
machinery for extending the original algorithm to utilize a adaptive hierarchical data structure 
called kd-trees j4], demonstrating the first truly hierarchical fast Gauss transform, which we call 
the Dual-tree Fast Gauss Transform (DFGT) . Second, we show how to integrate the series-expansion 
approximation within the dual-tree approach to compute kernel summations with a user-controllable 
relative error bound. We evaluate our algorithm on real-world datasets in the context of optimal 
bandwidth selection in kernel density estimation. Our results demonstrate that our new algorithm 
is the only one that guarantees a relative error bound and offers fast performance across a wide 
range of bandwidths evaluated in cross validation procedures. 



6 



1.4 Structure of This Paper 

This paper builds on [12] where the Dual- Tree Fast Gauss Transform was presented briefly. It 
adds details on the approximation mechanisms used in the algorithm and provides a more thorough 
comparison with the other algorithms. In Section[2l we introduce a general computational strategy 
for efhciently computing the Gaussian kernel sums. In Section [31 we describe our extensions to 
the dual-tree algorithm to handle higher-order series expansion approximations. In Section |4l we 
provide performance comparison with some of the existing methods for evaluating the Gaussian 
kernel sums. 

1.5 Notations 

The general notation conventions used throughout this paper are as follows. Q denotes the set of 
query points for which we want to make the density computations. TZ denotes the set of reference 
points which are used to construct the kernel density estimation model. Query points and reference 
points are indexed by natural numbers i,j € N and denoted qi and rj respectively. For any set S, 
\S\ denotes the number of elements in S. For any vector v S and 1 < z < D, let v[i] denote the 
i-th component of v. 

2 Computational Technique 

We first introduce a hierarchical method for for organizing the data points for computation, and 
describe the generalized N-body approach [S] El El that enables the efficient computation of kernel 
sums using a tree. 

2.1 Spatial Trees 

A spatial tree is a hierarchical data structure that allows summarization and access of the dataset 
at different resolutions. The recursive nature of hierarchical data structures enables efficient com- 
putations that arc not possible with single-level data structures such as grids and flat clusterings. 
A hierarchical data structure satisfies the following properties: 

1. There is one root node representing the entire dataset. 

2. Each leaf node is a terminal node. 

3. Each internal node N points to two child nodes and such that n iV^ = and 

N^UN^ ^ N. 

Since a node can be viewed as a collection of points, each term will be used interchangeably with 
the other. A reference node is a collection of reference points and a query node is a collection of 
query points. We use a variant of kd-tiees [1] to form hierarchical groupings of points based on 
their locations using the recursive procedure shown in Algorithm [2] In this procedure, the set of 
points in each node defines a bounding hyper-rectangle [A^.6[1].Z, A^.6[l].u] x [A^.6[2].^, A^.6[2].u] x 
• • • X [N.b[D].l, N.b[D].u] whose i-th coordinates for 1 < i < D are defined by: N.b[i].l — min x[i] 

x^N.V 

and N.b[i].u — min x[i] where N.V is the set of points owned by the node A^. We also define the 

xGN.P 
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Algorithm 2 BuildKdTree(7'): Builds a mid-point A;d-tree from V . 

N ^ empty node, N.V ^V, ^ 0, N^' ^ 

for each c? e [\-;D] do 

N.b[d].l ^ mina;[d], N.b[d].u ^ max2;[d] 

x^P x£P 

if \V\ is above leaf threshold then 
N.sd i- arg max N.b\d\.u ~ N .b\(i\.l 

l<d<D 

^ ^ ^ N.blN.sd].l+N.b[N.sd].u 

^ {x € VlxlN.sd] < N.sc}, V'^ ^ {x e V\x[N.sd] > N.sc} 
^ BuildKdTreeCP^), N^' ^ BuildKdTreeCP^) 

return N 



geometric center of each node, which is 



N.c- 



N.h[l].l + N.h[l\.u N.h[2].l + N.b[2].u N .b[D].l ^ N .h[D\.u 
2 ' 2 '" ' ' 2 



The node N is split along the widest dimension of the bounding hyper-rectangle N.sd into two 
equal halves at the splitting coordinate N.sc. The algorithm continues splitting until the number 
of points is below the leaf threshold. Computing a bounding hyper- rectangle requires OdT-"!) cost. 

2.2 Generalized A^-body Approach 

Recall that the computational task involved in KDE is defined as: Vq^ £ Q, compute G{qi, TZ) = 
e . The general framework for computing a summation of this form is formalized 

in \W,^,"r . This approach forms A;c?-trees for both the query and reference data and then perform a 
dual-tree traversal over pairs of nodes, demonstrated in Figure|4]and Algorithm^ This procedure is 
called with Q and R as the root nodes of the query and the reference tree respectively. This allows 
us to compare chunks of the query and reference data, using the bounding boxes and additional 
information stored by the kd-tree to compute bounds on distances as shown in Figure ID These 
distance bounds can be computed in 0{D) time using: 



\ fc=i 



+ (6) 



^(max{d-'[fc],<-'[fc]}) (7) 



\ fc=i 



where ^^[fc] = - g"[fc], d'^][k] = Q'[k] - i?"[fc], d;;/[fc] = i?"[fc] - Q'[fc], 

d^j[k] = — The CanSummarize function tests whether it is possible to summarize 

the sum contribution of the given reference node for each query point in the given query node. If 
possible, the Summarize function approximates the sum contribution of the given reference node; 
we then say the given pair of the query node and the reference node has been pruned. The idea is 
to prune unneeded portions of the dual-tree traversal, thereby minimizing the number of exhaustive 
leaf-leaf computations. 
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Query tree Reference tree 




Q R Q R 

Figure 4: Top: A kd-tree partitions 2-dimensional points. Each node in the kd-tree records the 
bounding box for the subset of the dataset it contains (highhghted in color) . In dual-tree recursion, 
a pair of nodes chosen from the query tree and the reference tree is considered at a time. Bottom: 
the lower and upper bound on pairwise distances between the points contained in each of the 
query /reference node pair. 

3 Dual-Tree Fast Gauss Transform 
3.1 Mathematical Preliminaries 

Univariate Taylor's Theorem. The univariate Taylor's theorem is crucial for the approximation 
mechanism in Fast Gauss transform and our new algorithm: 

Theorem 3.1. Ifn > is an integer and f is a function which is n times continuously differ entiahle 
on the closed interval [c, x\ and n + 1 times differentiahle on (c, x) then 

f{x) = Y.f^Hc)^^^ + R^ (8) 

where the Lagrange form of the remainder term is given by 
Rn = for some ^ € (c,x). 

Multi-index Notation. Throughout this paper, we will be using the multi-index notation. A 
D-dimensional multi-index a is a ZJ-tuple of non-negative integers. For any _D-dimensional multi- 
indices a, (3 and any x € MP ^ 

• \a\ = q[1] +q[2] H hQp] 

. a! = (a[l])!(a[2])!-- - 

• = (a;[l])°W(s[2])°'21 ■ • ■ 

. Q + /3 = (a[l]+/3[l],-- - ,a[D]+p[D]) 
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Algorithm 3 DualTree((5, _R): The dual-tree main routine. 



if CanSummarize((5, _R, e) then 

Summarize((5, R) 
else 

if Q is a leaf node then 
if i? is a leaf node then 

DualTreeBase(Q, R) 
else 

DUALTREE(g,i?^), 

else 

if i? is a leaf node then 

DuALTREE(g^,i?), 
else 

DUALTREE(0^,i?^), 
DUALTREE(Q^,i?^), 



DUALTREE(g, R^) 



DUALTREE(g^, R) 

DUALTREE(g^,i?^) 
DUALTREE(g'", i?^) 



• a-^ = (a[l] - ,a[D]-/3[D]) for a > /3. 

where di is a i-th directional partial derivative. Define a > P if a[d] > /3[d], and a > p for 
p e Z+ U {0} if > p for 1 < d < D (and similarly for a < p). 

Properties of the Gaussian Kernel. Based on the univariate Taylor's Theorem stated above, [11] 
develops the series expansion mechanism for the Gaussian kernel sum. Our development begins with 
one-dimensional setting and generalizes to multi-dimensional setting. We first define the Hermite 
polynomials by the Rodrigues' formula: 

H4t) = {-!)" e'^D"e-'\teR^ (9) 

The first few polynomials include: -ffo(0 — ^i(^) — 2t, H2{t) = 4i^ — 2. The generating function 
for Hermite polynomials is defined by: 



JlzjHnit) (10) 



Let us define the Hermite functions hn{t) by 

h„{t) = e-''H„{t) (11) 

,2 

Multiplying both sides by e~ yields: 



s 

E -M^~^ (12) 



n=0 



We would like to use a "scaled and shifted" version of this derivation for taking the bandwidth h 
into account. 
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Note that our Z?-dimensional multivariate Gaussian kernel can be expressed as a product of D 
one-dimensional Gaussian kernel. Similarly, the multidimensional Hermite functions can be written 
as a product of one-dimensional Hermite functions using the following identity for any t S K*^. 



where = (^[1] 



+ it[D])\ 



-(t[ll-s[l])^-(t[2]-s[21)-^ (t[Dl-a[Dl)^ 



-(t[ll-a[ll)^ -(t[21-a[21)^ 



-(t[n]-sln]y 



(14) 



(15) 



We can also express the multivariate Gaussian about another point so e as: 

2 D / oo 



j ^ J_ /£[d]_£0[d] ^ f t[d]-So[d] 



1 / s - So 



ha 



t - So 



The representation which is dual to Equation (I16p is given by 

2 D I ca 



e 2h2 = 



HE 

d=l \na=Q 



(-1)"" ^ /^to[d]-s(d)^ /^t[d]-to 



/2/!,2 



= E 



/3>0 



(-1)", /^C 



/2h? ) \V2h?) 



The final property is the recurrence relation of the one-dimensional Hermite function: 

/i„+i(t) = 2t ■ h„{t) - 2n ■ h„-i{t),te 
and the Taylor expansion of the Hermite function ha{t) about to e M^. 



I3>0 



/3! 



(16) 



(17) 



(18) 



(19) 



3.2 Notations in Algorithm Descriptions 

Here we summarize notations used throughout the descriptions and the pseudocodes for our algo- 
rithms. The foUowings are notations that are relevant to a query point qi e Q or a query node Q 
in the query tree. 

• TZs^-): The set of reference points rj^ e TZ whose pairwise interaction is computed exhaustively 
for a query point qi G Q or a query node Q. 

• TZ'y{ ): The set of reference points Vj^ e TZ whose contribution is pruned via centroid-based 
approximation for a given query point q^ e Q. 



11 



The foUowings are notations relevant to a query point G Q. 

• G{qi,R): The true initially unknown kernel sum for a query point qi contributed by the 
reference set i? C 7?., i.e. ^ ^hiWqi — rj^W). 

• G^{qi,TZ): A lower bound on G(gi, 7?.). 

• G'^{qi,TZ): An upper bound on G{qi,TZ). 

• G{qi,R): An approximation to G{qi,R) for i? C 72.. This obeys the additive property for a 
family of pairwise disjoint sets G i qi, [J Ri) — J2 G{qi,Ri). 

\ i=l J 1=1 
rn \ 

• G [qi, A refined notation of G \qi, IJ Rj to specify the type of approxima- 

tion used for each reference node Rj. 

Here we define some notations for representing postponed bound changes to G' (g;^ , TV) and [qi^^ , TV) 
for all qi^ e Q. 

• Q.A': Postponed lower bound changes on G^ {qi^,TZ) for a query node Q in the query tree 
and qi e Q. 

• Q.L: Postponed changes to G{qi^,TlT{qi^)) for qi^^ € Q. 

• Q.A": Postponed upper bound changes on G''{qi^,TZ) for a query node Q in the query tree 
and qi e Q. 

These postponed changes to the upper and lower bounds must be incorporated into each individual 
query qi^ belonging to the sub-tree under Q. 

Our series-expansion based algorithm uses four different approximation methods, i.e. A e 
{E,T{c,p),F{c,p),D{c,p)}. i!^ again denotes the exhaustive computation of ^ Kh{\\qi — rj^\\). 

T[c,p) denotes the translation of the order p—\ far-field moments of R to the local moments in the 
query node Q that owns qi about a representative centroid c inside Q. F{c,p) denotes the evaluation 
of the order p — 1 far-field expansion formed by the moments of R expanded about a representative 
point c inside R. D{c,p) denotes the p— 1th order direct accumulation of the local moments due to 
R about a representative centroid c inside Q that owns qi . We discuss these approximation methods 
in Section 1331 

3.3 Series Expansion for the Gaussian Kernel Sums 

We would like to point out to our readers that we present the series expansion in a way that sheds 
light to a working implementation. |11) chose a theorem-proof format for explaining the essential 
operations. We present the series expansion methods from the more informed computer science 
perspective of divide-and-conquer and data structures, where the discrete aspects of the methods 
are concerned. 

One can derive the series expansion for the Gaussian kernel sums (defined in Equation ([5])) using 
Equation ([TS]) and Equation P7|) . The basic idea is to express the kernel sum contribution of a 
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reference node as a Taylor series of infinite terms and truncate it after some number of terms, given 
that the truncation error meets the desired absolute error tolerance. 

The followings are two main types of Taylor series representations for infinitely differentiable 
kernel functions Kh (O's- The key difference between two representations is the location of the 
expansion center which is either in a reference region or a query region. The center of the expansion 
for both types of expansions is conveniently chosen to be the geometric center of the region. For 
the node region N bounded by [N.b[l].l, N.b[l].u] x • • • x [N.b[D].l, N.b[D].u\, the center is N.c = 



N.b[l].l+N.b[l].u 
2 ' 



N.b[D].l+N.b[D].u 
2 



1 T 



1. Far- field expansion: A far-field expansion (derived from Equation (fT6)) ) expresses the kernel 
sum contribution from the reference points in the reference node R for an arbitrary query 
point. It is expanded about R.c, a representative point of R. Equation (|16|) is an infinite 
series, and thus we impose a truncation order p in each dimension. Substituting qi for t, rj 
for s and R.c for sq into Equation (|16p yields: 



G{q.,R)= J2 

D / oa 

= E n E 
= E nf E 



1 



a[d]\ 
1 



R.c[d] 



ri„ eRd=l 



y[d]<p 



E 

a [d] >p 



a[d]\ 



V2h? 
r,„ [d] - R.c[d] 

j„ [d] - R.c[d 



/2h? 




Truncating after p terms along each dimension yields: 
G{q.,R)^G{q^,{{R,F[R.c,p))}) 

= E nfE 



»[d]<p 



-E Eif^ 



-E 



E - 



- ^ Ma{R,R.c)ha 




where we denote 



Mc(i?,c)= 



(20) 



which is a function of a reference node R and an expansion center c. We denote G{qi, {R, F{c,p)}) 
as the far-field expansion of order p — I for the kernel sum contribution of R ex- 
panded about c. Ideally, we would like to choose the smallest p such that the truncation 
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Figure 5: Given the query node Q containing the query points {(Zimlmli and the reference node 
R containing the reference points evaluating the far-field expansion generated by the 

reference points at the given query point up to four terms in each dimension, G{qi^^ , R) ~ 

1 f rj„-R.c^ 



Gfe„,{(i?,F(i?.c,4))}) - E 

a<4 



^ q! 



ha ( '''^2/1^ ' computing the sum 



of the element-wise product between the two-dimensional array of far-field coefficients with the 
query-dependent two-dimensional array. 
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R 




Figure 6: The Gaussian kernel sum series expansion represented by the far-field coefficients in R, 
J2 Ma{R, R-c)ha [~^^)^ valid regardless of the location of the given query point, given the 

size constraint on the reference node (see Section l3.5p . However, each query point location will 
incur different amount of error. 



after the chosen order p incurs tolerable error; this will be discussed in Section [531 Note that 
the far-field expansion for the Gaussian kernel separates the interaction between a reference 
point and a query point (namely e""'''"''^"" Z*-^'' •*) into a summation of two product terms. 
For each multi- index a, Ma{R, R.c), which depends only on the intrinsic information for the 
reference node (the reference points r^;^ g R and the reference centroid R.c which is con- 
stant with respect to R), is called the far-field moments/coefficients of the reference region R. 
Because Ma{R, R.c) part of the far-field expansion of the Gaussian kernel sums is the same 
regardless of the query point qi used for evaluation, they can be computed only once and 
stored within R for efficiently approximating the contribution of R for different query points 
(see Figured]). Precomputing the far- field moments for a reference node R up to terms 

''^ ) for each a < p) requires 0{\R\p^) operations. 

The far-field expansion of order p — 1 for the Gaussian kernel sums is valid for any query 
locations qi given that the reference node meets the certain size constraint (see Section [575)1 . 
However, for a fixed order p, evaluating on query points that are far away from the reference 
centroid in general incur smaller amount of error. 

2. Local expansion: A local expansion (derived from Equation ()17|) ) is a Taylor expansion of 
the kernel sums about a representative point Q.c in a query region Q. After substituting qi 
for t, Q.c for and rj^ for s, the kernel sum contribution of all reference points in a reference 
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region i? to a query point qi € Q is given by: 



^ -ii'ii--j„ii- 
G(g„J?)= e 



D oo 



= E n E 

D 



-1)' 



rid'- 



4d]--rjjd]\ fq^[d]-Q.c[d] 



'2h? 



/2h? 



Again, truncating after p terms along each dimension yields: 
G{q^,{{R,D{Q.c,p))}) 



ERE 

Tj^^ £R rf— 1 <p 



(-1)"% 



rid'- 



/2h? 



qi[d] - Q.c[i 



=E 

0<p 



E 



/3 



/3! 



/2F 



qi - Q-c 
•J2h? 
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where we denote: 



I , otherwise 



(21) 



-D((5.c,p))})}^ are the direct local moments of R for Q. The error bound criterion 
will be discussed in Section [3?5] Note that: 



G[q.,[j{{R.,D{p^))} = ^G(g.,{(E„,i?(p.))}) 



-EE 

a ;3<p„ 

= E 

/3 <max Pa 



E 



/2h? 



Y,LMiRa,D{Q.C,Pa))}) 



Qz ~ Q-C 

qi - Q-c 
V2f^ 



= E Lp[[j{{Ra,D{Q.c,pa))} 

/3 <max Pa 
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Figure 7: Given the query node Q containing the query points {Qi^lmli and the reference node 
R containing the reference points {?'j„}„J:i, evaluating the local expansion generated by the ref- 
erence points at the given query point up to third terms in each dimension, G{qi^ , R) ~ 



G(q,,„,{(i?,i^(Q.c,3))}) = E 

0<3 



E ^^-^^P i^TW^) (^"vlP^) ' involves taking the dot- 



product between the two-dimensional array of local coefficients with the query-dependent two- 
dimensional array. 



In other words, the local moments for a fixed query node Q are additive (see Figure |8]) across 
a set of disjoint portions of the reference dataset TZ since its basis functions 

remain the same for all reference points regardless of their locations. For a given reference 
node R, accumulating the local moments of R up to terms (that is, evaluating for each 
P < p) requires 0{\R\p^) operations. These local coefficients are accumulated and stored 
within the given query node. The local expansion represented by the local coefficients is valid 
for all query points within the query node under certain constraints. 




3.4 Gaussian Sum Approximation Using Series Expansion 

Now again assume we are given a query node Q and a reference node R. Here we describe three 
main methods that use the two expansion types for approximating Gaussian summation, G{q,R)., 
for each q £ Q. 
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Figure 8: Accumulating direct local moments from three reference nodes with the 
nodes R2, and -R3 contributing nine terms, four terms, and one term respec- 

tively to form the local moments containing the contribution from R2, and R3: 

L{{{Ri,D{Q.c,3)),{R2,D{Q.c,2)),{R3,D{Q.c,l))}). Zeros denote the positions that are not ex- 
plicitly computed using Equation 1^. L{{{Ri, D{Q.c,3)), {R2, D{Q.c,2)), {R3, D{Q.c,l))}) = 
L{{{Ri,D{Q.c, 3))}) + L{{{R2,D{Q.c, 2))}) + L({(i?3, D(Q.c, 1))}) is added to the total local mo- 
ments for Q. 



1. Evaluating a far-field expansion of R: Given the pre-computed far-field moments Ma(R) 
up to terms, one could evaluate the far-field expansion for a given query point q (that is, 
approximate G{q, R)) by forming a dot-product between the query-dependent vector and the 
far-field moments, as shown in Figure [5] and Figure [6l Approximating G{q,R) for all g € Q 
requires 0{\Q\p^) operations since evaluating the far- field expansion each time requires 0{p^) 
operations. 

2. Computing and evaluating a local expansion inside Q due to the contribution 

of R: one could iterate over each reference point r^,^ G R and compute the local moments 
Lp{{{R, D{Q .c, p))}) due to R up to p^ terms, as shown in Figure[7]and FigurelH The local 
accumulation of the contribution of the reference node R requires 0{\R\p^) operations, and 
evaluating the local expansion for each G Q requires a total of 0{\Q\p^) operations. 

3. Converting far-field moments of i? to a local expansion of Q: Suppose R has pre- 
computed far-field moments up to p^ terms. From the far-field moments, we can approximate 
the local moments of R but with some amount of error. This can be seen as a generalization of 
centroid-based approximation, jllj describes this method as one of the translation operators, 
called far-field to local translation operator, stated below: 

Lemma 3.2. Far-field to local (F2L) translation operator for Gaussian kernel (as 

presented in Lemma 2.2 in llll}jh Given a reference node R, a query node Q, and the truncated 
far-field expansion centered at a centroid R.c of R up to p^ terms: 
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Figure 9: Two-dimensional far-field coefficients truncated after the first two terms in each dimension 
can be converted into a set of local moments using Equation ([2^ . Computing T{Q.c, 2))}) 

involves summing up the element-wise product between the matrix (or tensor in higher dimensions) 
consisting of the far-field moments and the two-by-two window over the Hermite functions whose 
upper left multi-index is /3. This figure shows how to compute L(i i){{{R,T{Q.c,2))}). 



G{q^^,{{R,F{R.c,p))})^ E M^{R,R.c)K (^^^^). 

the Taylor expansion of the far-field expansion at the centroid Q.c in Q is given by G(qi^ , {(-R, F{R.c^p))}) 
E Lp{{{R,T{Q.c,p))}) i'-'-^Y where for q,,^ e Q, 

LMiR, T{Q.c,p))}) = J2 MAR, R.c)K+p f ^-"^-' ) (22) 



Proof. The proof consists of replacing the Hermite function portion of the expansion with its 
Taylor series: 

G{q,„,,{{R,F{R.c,p))}) = ^M<,(7?, R.c)h, 

^E-(-.^.E^l(^)(-#) 
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However, note G{qi^ , {{R, F{R.c,p))}) has an infinite number of terms, and must be truncated 
after terms. In other words, the local moments accumulated for Q are the coefficients 

for G(g.„,{(i?,r(Q.c,p))}) = ^ T(g.c,p))}) (^i-^) ^ as shown in Figure 1 

To compute {L^({(i?, T((5.c,p))})}^<p, we need to iterate over all of far-field moments 
{Ma_{R,R-c)\a<p for each T(Q.c,p))}). This operation runs in 0{Dp'^^) operations. 

In general, these approximations are valid only under certain conditions which depend on how the 
error bounds associated with these approximation methods are derived. Moreover, we have not 
discussed how to choose the method of approximation given a query and reference node pair, and 
how to determine the order of approximation, i.e. the number of terms required to achieve a given 
level of error. We discuss the details in Section [3?5l 



3.5 Truncation Error Bounds 

Because the far-field and the local expansions are truncated after taking terms, we incur an 
error in approximation. The original error bounds for the Gaussian kernel in were wrong and 
corrections were shown in [3 . Here we present the error bounds for (1) evaluating a truncated 
far-field expansion of a reference node for any query point q € (2) evaluating a truncated local 
expansion of Q due to the contribution of a reference node R for any query point € Q (3) 
evaluating a truncated local expansion formed from converting a truncated far-field expansion of a 
reference node R for any query point qi^ € Q. Note that these error bounds place restrictions on 
the size of the nodes in consideration: reference node, query node, or both. First we start with the 
truncation error bound for evaluating the far-field expansion formed for a given reference node. 

Lemma 3.3. Error bound for evaluating a truncated far-field expansion (as presented in 
JEi)' Suppose we are given a far-field expansion of a reference node R about its centroid R.c: 

G{q,{{R,F{R.c,p))})^ Z M^[R,R.c)k(^) where 

Ma{R, R.c) = X] ^ (^^1/2^^) ■ ^f^^hi ^ ^ satisfies \\rj^ — -R.c||co < fh for r < 1, then for 
any q G , 

|G(g,{(i?,n7?.c,p))})-G(g,i?)|<^^M_^V^V^ ' (23) 



Proof. We expand the far-field expansion as a product of one-dimensional Hermite functions, and 
utilize a bound on one-dimensional Hermite functions due to [l7]: ^|/irt(a;)| < -^e^~,n >0,x£ 



,(g[d],r,„[d],i?.c[d])= Yl 



1 / [d] - R-c[d] Y' I. f<l[d]- R-4d] 



VpA'l[d]:r,Jd],R.c[d])= — 



1 ( r.\d\- R.c\i. 



/in. 



D 



1^ 



q[d] - R.c[d\ 
y/2h? 



= n K.(g[d],r,„[d],ii.c[d]) +t;p,(<?[d],r,„[d],7?.c[d])) 
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We obtain for 1< < D: 



UpAl[<^:rjJd],R.c[d]) 



y:\-r 



2/12 



2^ 



"i=0 

(qld]-R-c[d]f 









/ q[d] - R.c[d]\ 


V2h^ 






\ V2h^ J 



- 1 - r 

7li=0 



1 



rii 



[d] - R.c[d] 



rh 



/2/i2 



/nil 



n^—p 
(,qld]-R.o\d]f 



/2h? 
. _}_ 

~- VP 



- R.c\d\ 

1 



Therefore, 



ll<!-r,-„l|- 



{q[d],rj^[d],R.c[d\)-e ^ 

d=l 



< 



q — R.c 



- E 



-||<i-rj„l|-' 

e Sfi^ 



''in 



□ 



Intuitively, this theorem implies that evaluating a truncated far-field expansion for a query point 
(regardless of its location) requires that the reference points used to form the expansion are within 
the bandwidth h in each dimension from the centroid R.c (i.e. the reference node has a maximum 

side length of 2K). 

The following gives the truncation bound for the local expansion formed inside a query node 
whose bound is within a hypercube of some side length. 

Lemma 3.4. Error bound for evaluating a truncated local expansion: Suppose we are given 
the local expansion about the centroid Q.c of the given query node Q accounting for the kernel sum 

contribution of the given reference node R: G{qi^^. {{R, D{Q.c,p))}) = ^ L^{{{R, D{Q.c,p))}) i 



where Qi^GQ and L0{Q,{{R,D{p))})= E ^^^^^ (^7^) 



2h 



If^Qim S Q satisfies \\qi^ — Q.c\\oo < rh for r < 1, then for any e Q: 

\G{qi^,{{R,D{Q.c,p))}) 



(24) 
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Proof. Taylor expansion of the Hermite function yields: 



1 fr. - R.c 



I3\ ^ a 

fi>0 a>0 



I3\ ^ Q! 

/3>0 Qf>0 



(-1) 



fi\ 



/3>0 



/2F 



-Q.c 



.c — R.C 



'.c — 7i.c\ / qi„ 



/27i2 



Use e = n (^p(9jm ['^]; ''in [^]; Q-c^]) + ^^p(9i™ M]j M]i Q-clc^])) for 1 < d < I?, where 



"d=C 
oo 



'2h? 



/2h? 



These univariate functions respectively satisfy Wp^ {qi^ [d\ , rj^ [d\ , (5.c[d] ) < and (q^^^ [d] , r^^ [d] , (5.c[i 
-^jz^, for 1 < < -D, achieving the multivariate bound. The proof is similar as in the one given 



in Lemma 157 



□ 

Lastly, we present the error bound for evaluating a truncated local expansion formed from a 
truncated far-field expansion, which requires that both the query node and the reference node are 
"small": 

Lemma 3.5. Error bound for evaluating a truncated local expansion converted from 
an already truncated far-field expansion: A truncated far-field expansion centered about the 
centroid R.c of a reference node R, 



G(q,{(i?,F(i?.c,p))}) = ^M„(i?,i?.c)/^„ 



q - R.c 
~7W 



< 



has the following local expansion about the centroid Q.c of a query node Q for qi^ G Q: G{qi^ , {(i?, F{R.c,p))}) 
E £^({(i?,T(g.c,p))}) (^^7^)^ where: T(g.c,p))}) = E M^{R,R.c)h^+p (^7^) 



0>O 



LetG{q,^,{{R,T{Q.c,p))))^ E Lp{{(R,T{Q.c,p)))) ( 

f3<p ^ 



, a 



truncation of the local expansion of G{qi^, {{R, F{R.c,p))}) after p^ terms. 

IfVqi^ G Q satisfies \\qi^ — <5.c||oo < rh andVrj^ G R satisfies \\rj^ — i?.c||oo < rh for r < ^, 
then for any qi^ G Q : 



Gfe™ , {{R, T{Q.c,p))}) ~ Giq,,„ , R)\ 



\R\ 



^(l_2r)20 



(25) 
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Proof. We define for 1 < d < I?: 



"pd = "^pili^ld], rj,,[d\, Q .c[d], R.c[d]) 
^Pd = Mlim [d],rj„ [d],Q.c[d\,R.c[d\) 
Wp^ = Wp{qi^ [d],rj,^ [d],Q.c[d\, R.c[d\) 



(-1)"' 1 ( R.c[d]-r,^[d] Y, , 
"■pd = 2^ —1— 2^ — ( 7^ ) (-1) ' 



"i=0 

p-i 



c[d] - _R.c[d] N( f [d] - Q.c[d] \ ' 



-J — ^ 

Q.c[d] - R.cld] \ / g,^ [d] - Q.c[d] \ "> 
V2h^ ) V V2h2 j 
~ (-1)"^ ~ 1 f B..c[d]'r,^[d] Y^ , ,sr. 



Plh? } 



=0 "J- 

Q.c\d\ - R.c\d\ \ ( \d\ - Q.c\d\ \ ' 



Plh? 



/2h? j 



Note that e = J| (up^ + ^^Pd +^'^Pd) for 1 < d < D. Using the bound for Hermite 

d=l 

functions and the property of geometric series, we obtain the fohowing upper bounds: 



Therefore, 



p— 1 p— 1 

—0 rij —0 

p-1 oo 

^ —0 Tij — p 

oo oo 



1 - (2r)^) 
1 - 2r 

1 /I - (2r)f \ / (2rf 



— p iij —0 



^;^E E(2'-r(2.r = :^ 



1 - 2r J \ l-2r 
1 \ (2r)'' 



V^Vl-2ry Vl-2r 



n «Pd - e ^ 

d=l 



G(9i,„ , T(Q.c, p))}) - G(g,„ , fl) 



□ 



[l6| proposes an interesting idea of using Stirling's formula (for any non-negative integer n, 
(■^^7-^) ^ "0 to lift the node size constraint. This could allow approximation of larger regions that 
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Algorithm 4 FarFieldOrder((5, i?, r): Determines the order of approximation needed for eval- 
uating a far-field expansion of the reference node R. 

R.b\d].u~R.b{d].l 

r -s— max — „, 

l<d<D 

if r > 1 then 
return oo 
else 

while p < pmax do 
p ^ p + I 

D-1 „ / „ \ D-k 



fe=0 

return p 
return oo 



possibly contain more points. Unfortunately, the error bounds derived in |16j were also incorrect. 
We have derived the necessary corrected error bounds based on the techniques in [3]. However, we 
do not include the derivations here since using these bounds actually degraded performance in our 
algorithm. 

3.6 Determining the Approximation Order 



Note that Lemma 13.31 Lemma 13. 4| and Lemma 13.51 answer the question of the following form: 
given that we use p^ terms in the appropriate expansion type, what is the upper bound on the 



approximation error. 



G{q,R) - G{q,R) 



? Nevertheless, all three lemmas can be re-phrased to 
answer the question in reverse: given the maximum user-desired absolute error, what is the order 
of approximation/number of terms required to achieve it? This question rises naturally within 
our dual-tree based algorithm that bounds the kernel sum approximation error on each part in a 
partition of the reference dataset TZ. 

Algorithm |4] shows how to determine the necessary order of the far- field expansion for the given 



reference node R such that 



G{q,R)-G{q,R) 



< T. That is, the approximation error due to the 
far-field expansion of R is bounded by the error allocated for approximating the contribution of 
the reference node R. Using far-field expansion based approximation requires a "small" reference 
node. Thus, the algorithm first computes the ratio of the maximum side length of R to twice the 
bandwidth h, and determines the least order required for achieving the maximum absolute error r 
by evaluating the right-hand side of Equation (I23p iteratively on different values of p. 

Algorithm [5] shows how to determine the necessary order of the local expansion formed by 
directly accumulating the contribution of the given reference node R onto the given query node Q. 
This approximation method requires the query node Q to have the maximum side length within 
twice the bandwidth. The algorithm determines the least order required for achieving the maximum 
absolute error r by evaluating the right-hand side of Equation (p4)) iteratively on different values 
of p. 

Finally, Algorithm [6] determines the necessary order of local expansion formed by converting 
a truncated far-field expansion of the given reference node R. In contrast to the two previous 
algorithms, this one requires both the query node Q and the reference node R to have a maximum 



24 



Algorithm 5 LocalAccumulationOrder(Q, i?, r): Determining the order of approximation 
needed for forming a local expansion of the contribution from the given reference node R for the 
query node Q. 

max OM dl-QM-iU 

l<d<_D 

if 7- > 1 then 
return oo 
else 

while p < pmax do 
p -s— p + 1 

if (T^ g - (S)"" < r then 

return p 
return oo 



Algorithm 6 ConvertFarFieldToLocalOrder((5, i?, r): Determining the order of approxi- 
mation needed for evaluating a far-field expansion of the given reference node R. 
max ^^^{ ^■''^.^-9.^^].^ H -^M-u-i^.M^q./ } 

if r > i then 
return oo 
else 

p^O 

v^rhile p < p,nax do 
p -s— p + 1 

if % (^)((1 - i^rrff (((HrnMrn)""'' < r then 

return p 
return oo 



side length less than the bandwidth h. After the node size requirements are satisfied, the least order 
required for achieving the maximum absolute error r is obtained by evaluating the right-hand side 
of Equation ([25]) iteratively on different values of p. 

3.7 Deriving the Hierarchical FGT 

Until now, we have discussed the approximation methods developed for a non-hierarchical version 
of fast Gauss transform described in In this section, we derive the two additional translation 
operators that extend the original fast Gauss transform to use a hierarchical data structure. Here we 
consider the reference tree, which enables the consideration of the different portions of the reference 
set 7?, at a different granularity. Given the computed far- field moments of i?^ and R^ , each centered 
at R^.c and R^.c, how can we efficiently compute the far-field moments of R centered at R.c, the 
parent of R^ and R^l The first operator allows the efficient bottom-up pre-computation of the 
Hermite moments in the reference tree. 
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Figure 10: The solid arrows mark the flow of contribution from the reference tree to the query tree 
in case of a prune via a far-field to local translation between the reference node R and query node 
Q. On the reference side, the far-field moments are formed in the bottom-up fashion; on the query 
side, the accumulated local moments will be propagated downwards during a post-processing step 
via local-to-local translations. 

Lemma 3.6. Shifting a far-field expansion of a reference node to a new center (F2F 
translation operator for the Gaussian kernel): Given the far-field expansion centered at R.c 
in a reference node R: 




this same far-field expansion shifted to a new location c' 



is given by: 



Giq, {{R, FiR.cp))}) = Giq, {{R, F{c',p))}) = ^ M-,{R, c')h. 




where 



(26) 



Proof. Replace the Hermite part of the expansion by a new Taylor series: 



G{q,{iR,F{R.c,p))}) 
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Figure 11: Given the far- field moments of i?^ and illustrated in the first two tables, Theorem l3.6 
can re-center each set of far-field moments of and R^ at centroid R.c. The re-centered far-field 
moments are shown in the third table with two numbers, each contributed by i?^ and R^ . The 
far-field moments of R are then computed by adding up the two re-centered moments entry-wise. 



/3! V %/2^ 



<-"'"'- (^) 



a<p /3>0 



1 fc'-R.c 



I3\ V V2h^ 



1 fR.c-c' 



131 V V2h^ 



=E 



0<a<7 



R.C - c' 

"TIP" 



where 7 = a -f /?. 



□ 



Using Lemma l3.61 we can compute the far-field moments of Q centered at Q.c by translating the 
moments {Af^(_R^, R^.c)}^^p and {My{R^, R^ .c)}y^p to form the moments {My{R^, R.c)}^^p and 
{M^iR'^, R.c)}^<p. Then, the far-field moments of i? = i?i U i?2 are {M^{R^,R.c) + M^{R^, R.c)} 
and 



G{q,{{R,F{R.c,p))}) = Y^{KLi{R^,R-c) + K'Li{R'^,R.c))h., 

7<P 



g — i?.c 



Computing each M^{R^,R.c) from M^{R^,R^.c) (and each M-^{R^, R.c) from M-f{R^, R^.c)) re- 
quires iterating over at most terms. This operation runs in 0{Dp^^), which can be more efficient 
than computing the far-field moments of R centered at R.c from scratch (which is 0{\R\Dp^)). 
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The next translation operator acts as a "clean-up" routine in a hierarchical algorithm. Since we 
can approximate at different scales in the query tree, we must somehow combine all the approxi- 
mations at the end of the computation. By performing a breadth-first traversal of the query tree, 
the L2L operator shifts a node's local expansion to the centroid of each child. 

Lemma 3.7. Shifting a combined local expansion of a query node to a new center (L2L 
translation operator for Gaussian kernel): Given a combined local expansion centered at Q.c 
of the given query node Q: 



G(g,7^I,((5)u7^r(Q)) 



^La(Q.c,7^I,(0)u7^r(0)) 

/3<P 



Q.C 



Shifting this local expansion to the new center c' £ Q yields: 
G{q,TZT,{Q)uTZT{Q)) 



/3>a 



!(/?- 



jL^iQ.c, TZT>iQ)uTZT{Q)) 



Q.c 



'2h? 



0-a 



'2h? 



where we denote 

L/3(c',7^I,(Q)U7^r(Q))= ^ 



/3>o 



q!(/3- 



-Lp{Q.c, 7^I,(Q)U7^r(Q)) 



Q.c 



/2h? 



(27) 



Proof. Use the multinomial theorem to expand about the new center c': 
G(g,7^I,(Q)u7^r(Q)) = ^ L,3(g.c, 7^D(g) u 7^r(g)) 

/3<P 

whose summation order can be interchanged to achieve the result. □ 

Using Lemma 13. 7( we can shift the local moments of Q centered at Q.c to a different expan- 
sion center, such as an expansion center of one of the child nodes of Q. Let p be the maximum 
approximation order used among the reference nodes pruned via far-to-local translation (TZf{Q)) 
and direct local accumulation {TZv{Q))- The local moment propagation to both child nodes of Q 
is achieved by the following operations: 

{L^(0^.c,7^I,(0^)U7^r(0''))}^)<p^{L;9(Q^.c,7^I,(Q^)u7^r(Q''))};9<P+ 

{L^(0^.c,7^I5(Q) U7^r(Q))}/3<p 

{i8(Q^.c, 7^I,((3^) U 7^r(0''))},3<p ^{L;3((3''.c,7^I,(Q^) u7^r(Q''))}/^<p+ 

{L^(0^.c,7li,(0) U7^r(Q))};9<p 

where the addition operation is an element-wise operation that combines the two scalars with the 
same multi-index position. 
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Figure 12: Given the local moments centered at Q.c, Theorem 13.71 can re-center them at the two 
centroids Q^.c and Q^.c. 



3.8 Choosing the Best Approximation Method 

Suppose we are given a query node Q and a reference node R pair during the invocation of Algo- 
rithm [131 CanSummarize function for the higher-order DFGT algorithm has four approximation 
methods available: A G {E,T{c,p), F{c,p), D{c,p)} (see Section . Because we would like to 
avoid exhaustive computations, the higher-order DFGT algorithm uses only three of the approxima- 
tion methods and defers exhaustive computations until query/reference leaf pairs are encountered. 
Algorithm [7] tests whether the given query node and reference node pair can be approximated by 
evaluating the far-field moments of R, computing direct local accumulation due to R, and trans- 
lating some of the terms that constitute the far-field moments of R (far-field-to-local translation 
operator) and evaluates the asymptotic cost of each approximation. Algorithm [7] then determines 
the approximation method with the lowest asymptotic cost. This idea was originally introduced 
in [TT] in the description of the original fast Gauss transform algorithm. The key difference is that 
even if Algorithm [7] returns E (when none of the other approximation methods can beat the cost of 
the exhaustive method), our hierarchical algorithm will not default to exhaustive evaluations and 
will consider the query points and reference points at a finer granularity, as shown in Algorithm [T31 



3.9 Hierarchical FGT 

Given the analytical machinery developed in the previous section, we now describe how to extend 
the centroid-based dual-tree [71 [5] to do higher-order approximations. The main structure of the 
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Algorithm 7 ChooseBestMethod((5, _R, r): Chooses the FMM-type approximation with the 
least cost for a query and reference node pair. 

Pf ^ FarFieldOrder((5, i?, r) 

Pd ^ LocalAccumulationOrder((5, i?, r) 

PT ^ CONVERTFARFlELDToLOCALORDER((5,i?, r) 

CF ^ NqDP^ + \ CD ^ NrDP^ + \ Ct ^ D^P^ + \ CE ^ DNqNr 

if cf — niinjci?, cd, ct, ce} then 

return F{R.c,pf) 
else if CD = miii{cF , cd , ct , ce} then 

return D{Q.c,pd) 
else if Ct = minjci?, C£), ct, c^;} then 

return T{Q.c,pt) 
else 

return E 
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Figure 13: Four ways of approximating the contribution of a reference node to a query node. 
Top left: exhaustive computations (few reference points/few query points); Top right: far- 
field moment evaluating (many reference points/few query points); Bottom left: direct local 
moment accumulation (few reference points/many query points); Bottom right: far- field-to- local 
translation (many reference points/many query points). 



algorithm is shown in Algorithm |51 We provide only a high-level overview of our algorithm and 
defer the discussion on the implementation details to Appendix. 

Initialization of the query tree. Each query node maintains a vector storing {pmax)^ terms, 
where Pmax is a pre-determined limit on the approximation ordei0 depending on the dimensionality 
of the query set Q and the reference set TZ. For the experimental results, we have fixed Pmax = 6 
for D = 2, p,nax = 4 for L» = 3, p,nax = 2 for L) = 4 and D = 5, Pmax = 1 for £) > 6. 

Pre-computation of far-field moments. Before the main KDE computation can begin, we pre- 
compute the far-field moments of each reference node in the reference tree up to {pmax)^ terms. We 
show how to efficiently pre-compute the far-field moments of each reference node in the reference 
tree in Algorithm [TOl The algorithm uses Equation pO| for the leaf node and Equation (l26l) for 

^We impose this limit because the number of terms scales exponentially with the dimensionality D, O(p^). 
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Algorithm 8 DFGTMain(Q, 7^): The main KDE routine. 

QToot ^ BUILDKdTREE(Q), i?™"* 4- BUILDKDTREE(7^) 

DFGTlNiTQ(g™°'), DFGTlNiTR(i?™°*) 
DFGT(g™°S i?™"*), DFGTPoST(g™°') 



Algorithm 9 DFGTInitQ(Q): Initializes query bound summary statistics. 

{Initialize the node bound summary statistics.} 

G'(Q,7^)^0, G"(Q, 7^) ^ |7^|, g.A"^0 

{Initialize translated local moments to be a vector of length (p^oa;)-} 

<3-^o<i<(p„„.)^ ^ 

if Q is a leaf node then 

{Initialize for each query point.} 

for each q^,^ € Q do 

G(g,„ , 7^) ^ 0, G((7,„ , lis )) ^ 
Gfe„,7e^((7,„))^0, G(q,„,7^pfe,Ju7^rfe,J) ^0 

else 

DFGTInitQ(Q^), DFGTInitQ(Q^) 



translating the moments of the child nodes for the internal node case. We describe the implemen- 
tation details in Appendix. 

Determining the prunability of the given query and reference pair (shown in Algo- 
rithm [T^]). Note that the function Summarize includes calls to the following functions (see Ap- 
pendix): 

1. EvalFarFieldExpansion: evaluates the far-field moments stored in R at each query point 
in Q up to (pi?)^ terms. See Algorithm [23l 

2. AccumulateDirectLocalMoment: computes direct local moment contribution of R cen- 
tered at Q.c in Q. See Algorithm [25] 

3. TransFarToLocal: translates the far-field moments of R up to (pr)^ terms to the local 
moment centered Q.c in Q. See Algorithm [24l 

Dual-tree Recursion. Algorithm [T3] shows the basic structure of the dual-tree based KDE 
computation (see Figure S]). This procedure is first called with Q and R as the root nodes of the 
query and the reference tree respectively. CanSummarize takes three parameters: the current 
query node Q, the current reference node R, and the global relative error tolerance e. This function 
tests whether the the contribution of the given reference node for each query point in the given 
query node can be approximated within the error tolerance. If the approximation is not possible, 
then the algorithm continues to consider the query and the reference data at a finer granularity. 
The basic idea is to terminate the recursion as soon as possible by considering large "chunks" of the 
query data and the reference data and avoiding the number of exhaustive leaf-leaf computations. 
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Algorithm 10 DFGTInitR(_R): Pre-computes far-field moments. 



{Initialize the far- field moments of R to be empty.} 
for i = to i < {pmax)° do 

^PositionToMultiindex(j,p„„,.)('^' -^■"^) ^ *^ 
if i? is a leaf node then 

{Accumulate far-field moment from each point (Equation ([20])).} 

ACCUMULATEFARFlELDMOMENT(i?) 

else 

{Recursively compute the moments of the child nodes and combine them.} 
DFGTlNiTR(i?^), DFGTlNiTR(i?'R) 

TRANSFARToFAR(i?^, i?), TRANSFARToFAR(i?^, R) 



Algorithm 11 CanSummarize(Q, i?, e): Determines the prunability of the given query node Q 
and reference node R 

return ChooseBestMethod (q, R, M\£l:ZliQiEy\ ^ e 



We can achieve this if we utilize approximation schemes that yield high accuracy and have cheap 
computational costs. 

Each prune made for a pair of a query and a reference node is summarized in the given query 
node by incorporating the lower and the upper bound changes and 6^{Q,R) contributed 

by the reference node into Q.A' and Q.A". These two bound updates due to a prune can be 
regarded as a new piece of information which is known only locally to the given query node Q. All 
of the bounds in the entire subtree of Q should reflect this information. One way to achieve this 
effect is to pass the lower bound and the upper bound changes owned by Q (i.e., Q.A' and Q.A") 
to Q's immediate children, whenever the algorithm needs to consider the query dataset at a finer 
granularity by recursing to the left and the right child of Q. 

Base-case Computation. If a given leaf query and leaf reference node pair could not be pruned, 
then DFGTBase (shown in Algorithm [T4| is called. Because all kernel evaluations are computed 
exactly, we can refine the bound summary statistics of the given query node Q (that is, G^Q^TZ) 
and G'^{Q,TV)) further and hence we reset them to oo and —oo respectively. For each query point 
lim G Q, we first incorporate the postponed bound changes passed down from the ancestor node of 



Algorithm 12 Summarize((5, i?): Summarizes the contribution of R. 
{Add bound changes.} 

Q.A' ^ Q.A' + 5\Q, R), g.A" ^ Q.A" + <5"(Q, R) 
if A is of the form F{R.c,pf) then 

EVALFARFlELDEXPANSION(i?, Q,Pf) 

else if A is of the form D{Q.c,pd) then 

ACCUMULATEDlRECTLOCALMOMENT(i?, Q,Pd) 

else 

TRANSFARToLOCAL(i?, Q,Pt) 
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Algorithm 13 DFGT{Q, R): The core dual-tree routine for computing KDE. 



S\Q,R) = \R\Kh{d^{Q,R)), S^{Q,R) = \R\{Kh{d'{Q, R)) - 1) 
{Add postponed contributions/bound changes from the current pair.} 

(g, 7^) ^ G' {Q,n) + q. a' + s^{q, r) 
Qu,newi^Q^ 7e) ^ G^{Q,n) + g.A" + (5"(q, r) 

if CanSummarize((5, R, e) then 

Summarize((3, R) 
else 

if Q is a leaf node then 
if i? is a leaf node then 

DFGTBASE(g,i?) 
else 

DFGT(Q, i?^),DFGT(g, R^) 

else 

{Push down postponed bound changes owned by Q to the children.} 

g^.A' ^ g^.A' + g.A', g-^.A' ^ g^.A' + g.A' 
g^.A" ^ g^.A" + g.A", g^.A" ^ g^.A" + g.A" 
g.A' ^ 0, g.A" 4- 

if i? is a leaf node then 

DFGT(g^,i?), DFGT(g'",i?) 
else 

DFGT(g^, i?^),DFGT(g^, i?^),DFGT(g^, E^),DFGT(g^, R") 
{Refine the bounds based on the recursion results.} 

G^{Q,n) ^ min{G'(g^,7^) + g^.A^G"(g^,7^) + g^.A'} 

G'^{Q,'R.) ^ max{G''iQ^,n)+Q^.A'',G''{Q",n)+Q^.A''} 



Q. We loop over each reference point rj^ e R and compute the kernel value between qi^^ and rj^ and 
accumulate the lower bound G'' {qi^,Tl), the kernel sum computed exhaustively G{qi^,TZ£{qi^)), 
and the upper bound G^{qi^,TZ) 

Note that we subtract one for updating G" {qi^ , TZ) for correcting the prior assumption that 
^h{\\qi^ — ^j„||) = 1, while the lower bound G\qi^,TZ) and G{qi^,'Rs{qi^)) are incremented by 
Kh{\\qi^ — rj^W). As the contribution of the reference node R is added onto the query point gi„'s 
sum, we can refine the bound summary statistics owned by Q such that (g, 7?,) = min G' (g^^ , TV) 

and G'^{Q, TZ) — max G'^{qi^,TZ). Finally, we reset the postponed bound changes stored in Q to 
zero. 

Post-processing (shown in Algorithm [T5|) . For the non-leaf case, the local-to- local translation 
operator (TransLocalToLocal) is called to re-center the local moments at the current level and 
passes them down to the child nodes. For the leaf-case, EvalLocalExpansion is called to convert 
local moments to a single scalar that represents the contribution to a given query point. 
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Algorithm 14 DFGTBase((5, -R): Computes exact contribution of i? to Q. 

G'(Q,7^)^oo, G"(Q,7^) <(- -oo 
for each G Q do 

{Add postponed changes passed down from the ancestor node of Q.} 

for each r^;, G i? do 

vj^ Kh(U^ - r,J|)_, G\q,^,n) ^ G\q,^,n) + v 

G{qi^,n£{qiJ) ^ Giqi^,n£{qiJ)+v 

G«fe„,7^)^G"fe,„,7^) + («-!) 
{Refine the bound summary statistics owned by Q.} 
&{Q,n) ^ mm{G\Q,n),G\qi^,Tl)} 
G^{Q,TZ) ^ max{G«(Q,7e),G"fe„,7e)} 

g.A' ^ 0, g.A" ^ 



3.10 Basic Properties of DFGT Algorithms 

Theorem 3.8. Lower/upper hounds are maintained properly at all times for each q € Q and each 
query node Q during the function call DFGTMain. 

Proof. We show that the bounds are maintained properly for three main parts in the function 

DFGTMain: DFGTInitQ, DFGT, and DFGTPost. 

The function call DFGTInitQ: It is clear that for ah e Q, G'(gJ,7^) < G(g,,7^) < 
G"(ft,7^) = \n\. Furthermore, for each query node Q, = G\Q,n) < G(g,„,7^) < G''{Q,n) = 
\TZ\ for each qi^ e Q. 

The function call DFGTBase: Let Q and R be the query node and the reference node respectively. 
For each query point g,^ € Q, G\qi^, TZ) is incremented by Q-A' + ^ Kh{\\qi^ ~ ^jnll)! 

G"(gi„,7^) by O.A"+ 

^ {Kh{\\qi^ — rj^W) — 1); this operation incorporates the passed-down contribution for e Q, 

and un-docs the assumption made during the initialization phase of DFGTInitQ. G^{Q,TZ) and 
G'^{Q, TV) are updated to be the minimum among G\qi^ , TV) and the maximum among G^{qi^ , TZ) 
respectively. The postponed bound changes Q.A' and Q.A" are cleared to avoid double-counting 
when Q may be visited later. 

The function call DFGT: We induct on the number of points owned by the query node Q and the 

reference node R in consideration (i.e. |Q| + \R\). The only possible places that change G\qi^,TZ), 
G'^{qi^,n), G\Q,n) and G'^{Q,Ti) are the caU to the base case function DFGTBase and the last 
two lines of the function DFGT. The correctness of DFGTBase function is proven already, so we 
consider the second case. The two function calls DFGT{Q^,R) and DFGT(Q^, ii) (in case R is 
a leaf node) and the four function calls 

DFGT((5^, R^), DFGT(Q^, R^), DFGT(Q^, R^), and DFGT(Q^, R^) (in case R is an internal 
node) are smaller subproblems than (Q, R) pair. By the induction hypothesis, these calls maintain 
the lower and the upper bounds properly. The lower bound is set to the minimum of the "best" 
lower bound owned by the children of Q: min{G'(Q^, 11) + Q^.A', G'{Q^, TV) + Q«.A'}. Similarly, 
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Algorithm 15 DFGTPost((5): The post-processing routine, 
if Q is a leaf node then 

for each G Q do 

{Add bound changes for the query node at the given query point qt^.} 
G'fe„,7^) ^ G'((z,„,7^) + Q.A', G"fe„,7^) i- G"((7,„,7^) + g.A" 
{Refine summary statistics for fower and upper bounds.} 
G'(Q,7^) ^ min{G'(Q,7^),G'{g,„,7^)} 
G"(Q, 7^) ^ max{G"(Q, 7^), G"((?,„ , 7^)} 

{Compute the contributions from the accumulated local moments.} 
G{qt^,TZr{qi„J) ^ EvalLocalExpansion((5) 

{Sum the contribution from the local moments (direct or translated), the far- field evaluations, 
and exhaustive evaluations.} 

G(g^„,7^) ^ G(g,,„,7^I,(g,„J U 7^r(g^„J) + G(g^„,7^^(g^,„)) + G(g,„,7^£) 
A'(Q)^0, A"(Q)^0, Q.L^O 
else 

TransLocalToLocal((3, Q^), TransLocalToLocal((3, Q-'^) 

Q^.A' ^ Q^.A' + Q.A', Q«.A' ^ Q^.A' -|- Q.A' 

Q^.A" 4- Q^.A" + Q.A", Q^.A" ^ Q'^.A" + Q.A" 

Q.i ^ 0, Q.A' ^ 0, Q.A" 

DFGTPost(Q^), DFGTPost(Q^) 

{Refine the bounds based on the results of the recursion.} 

G' (Q, 7^) ^ min{G' (Q^ , 7^), G' (Q«, 7e)} 

G"(Q,7^) ^max{G"(Q^,7^),G"(Q^,7^)} 



the upper bound is set to the maximum of the "best" upper bound owned by the children of Q: 
max{G''(Q^,7^) -f Q^.A",G''(Q^,7^) + Q^.A"}. 

The function call DFGTPOST: We again induct on the number of points owned by the query 
node Q passed in as the argument to this function. If the query node Q is a leaf node, each query 
point qi^ E Q incorporates the passed-down bound changes Q.A' and Q.A". The bounds G\Q,TV) 
and G"(Q,7?.) are (correctly) set to the minimum among {qi^,TZ) and the maximum among 
G"(g^„,7^). If Q is not a leaf node: we know the sub-calls DFGTPost(Q^) and DFGTPost(Q^) 
maintains correct lower and upper bounds by the induction hypothesis since Q^ and contain a 
smaller number of points. Setting the lower and upper bounds for Q by the operations: G'(Q, TZ) -s— 
min{G'(Q^,7^),G'(Q^,7^)}, G''{Q,n) ^ max{G"(Q^, 7^), G"(Q^, 7^)} is valid. □ 

Theorem 3.9. After calling DFGTPoST (Algorithm {1^ in DFGTMain (Algorithm 0), each 
query point qi E Q accounts for every reference point rj E TZ in its Gaussian kernel sum approxi- 
mation G{qi,TZ). 

Proof. In Algorithm 1131 for each qi G Q, each rj G 7?. is cither accounted by an exhaustive compu- 
tation in DFGTBase or a prune in Summarize. All exhaustive computations for qi E Q directly 
update G{qi,Tl.£{qi)), while any pruned contributions will be incorporated into each G{qi,TZi{qi)) 
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(hence into G{qi,'R-{qi))) and when they are pushed down (to the leaf node to which Qi belongs) 
during the DFGT recursion or DFGTPosT. □ 



Theorem 3.10. For each query point qi G Q, the approximated kernel sum G{qi,TZ) satisfies the 
global relative error tolerance e. 

Proof. For simplicity, let us limit the available approximation methods to A e {E,T{c, 1)} where 
E denotes the exhaustive computation and T{c,l) denotes the centroid-based approximation about 
c. 

Given qi € Q, let Q' be the (unique) leaf node that owns q^. Let {^t„}^i denote the set 
of reference nodes whose kernel sum contribution were accoimted via centroid approximation and 
{REi}^J^i the set of reference nodes whose kernel sum contribution were computed exhaustively. 



Then it is clear that TZ 



a=l 



N,, 



b=l 



U ^tJ u U Re, with i?T„, n Rt^„ = 0, Re,, n Re,„ 



Rtj n Re,, = for 1 < a', a" < Na and 1 < b' ,b" < Nb- Let Qt^ be the query node that owns qi 

and is considered with the reference node Rt^ and pruned. Let G'^°'((5t„,7^) be a "snapshot" of 
the running lower bound on the kernel sum for query points owned by Qto, at the time the query 
node Qto, and the reference node -Rt„ were considered (and subsequently pruned). By the triangle 
inequality: 

|Gfe,7^)-Gfe,7^)| 



G L,, i [J BtA ^ i \J Re, 



G (qi, {{Rt^,T{Q.c, 1))}) - Gfe, RtJ 



J2G{qi,{iREb,E)}) -G(qi,RE,) 



iVa I . 

< E r {(^r^ , TiQ.c, 1))}) - Gfe, RrJ I + 1^ |g (</i, {{Re, , E)}) - G{qi, Re,)\ 

6=1 



a=l 

< y ' I I max 



o=l 

Na, 



\K,,id^(QTa,RTj) - K,,{\\QTa-C-RTa4\ 
I Khid' (QTa ,RtJ)- K^iWQTa-C- RTa 4 | 



Nt 



E%^G'W(Q,„,7^) + ^^G'W(Q',7^) 



Nb 



\REb\^, 



The proof can be easily extended to the case with four available approximation methods A G 
{E,T{c,p),F{c,p),D{c,p)}. □ 
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Figure 14: Empirical comparison of six different algorithms on different magnitudes of bandwidths 
on three different datasets. Each entry in the table has a timing number (if finite), oo symbol (if 
no parameter tweaking could achieve the error tolerance) , X symbol (if the algorithm segfaulted) . 

4 Experimental Results 

We evaluated empirical performance of six algorithms: 

• Naive: the brute- force algorithm (Algorithm [1} . 

• EFT: Fast fourier transform based kernel density estimate [18]. 

• FGT: Fast Gauss transform [TT] . 

• IFGT: improved fast Gauss transform [TOl [T5] . 

• DFD: the dual-tree centroid-based approximation method (TKH]- 

• DFGT: our new algorithm (Algorithm |8]) . 
We used the following six real-world datasets: 

• sj2-50000-2: two-dimensional astronomy position dataset. 

• colorsSOk: two-dimensional astronomy color dataset. 

• bio5: five-dimensional pharmaceutical dataset. 
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Figure 15: Empirical comparison of three algorithms on different magnitudes of bandwidths on 
three larger datasets. All timings are reported in seconds. 



• edgsc-radec: two-dimensional astronomy angle dataset. 

• mockgalaxy-D- 1 M: three-dimensional astronomy position dataset. 

• psfl-psf4-stargal-2d-only: two-dimensional astronomy dataset. 

Note that the last three datasets contain over 1 million points and demonstrate the scalability 
of our fast algorithm. For each dataset, we evaluated the empirical performance on computing 
kernel density estimates at seven different bandwidths ranging from 10^"^ to 10'^ times the optimal 
bandwidths according to the standard least-squares cross-validation score [15]. We measured the 
time required for computing KDE estimates that guarantee the global relative error criterion: 

G{qi,TZ) — G{qi,TZ) < eG{qi,TZ). We used e = 0.01. Each entry in the table has a timing number 

(if finite), oo symbol (if no parameter tweaking could achieve the error tolerance), X symbol (if the 
algorithm segfaulted; this is common in grid-based algorithms in higher dimension). The entries 
under S symbol denote the total time for least-squares cross-validation. Note that the FGT ensures: 

G{qi,TZ) — G{qi,TZ) < T. Therefore, we first set r = e, halving r imtil the error tolerance e was 

met; the time for verifying the global error guarantee (which includes comparison against the naively 
computed results) was not included in the timing. For the FFT, we started with 16 grid points 
along each dimension, and doubled the number of grid points until the error guarantee was met. 
For the IFGT, we took the most recent version of the algorithm that does automatic parameter 
tuning described in |13) . Our algorithms based on dual-tree methods guarantees the error bound 
automatically via a direct parameter e. 

The naive timings for the last datasets have been extrapolated from the performances on the 
smaller datasets. Our results demonstrate that our new algorithm can be as 15 times as fast as 
the original dual-tree algorithm. As expected, the grid-based original fast Gauss transform and the 
fast Fourier transformed based method fails in dimensions above two. 
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Figure 16: Top: It is conceptually easy to visualize the moments to be stored in a multi-dimensional 
array conceptually. Each dimension iterates over pmax scalars, giving a total count of (pmax) 
scalars. Bottom: The linear layout for the storing the coefficients. 

5 Conclusion 

In this paper, we combined the two methods: the dual-tree KDE [8 and the original fast Gauss 
transform [TT] to form the hierarchical form of the fast Gauss transform, the Dual-tree Fast Gauss 
Transform. Our results demonstrate that the 0{p^) expansion helps reduce the computational 
time on datasets of dimensionality up to 5. 

Appendix: Implementing the Gaussian Series-expansion 

This section explains how to implement the series-expansion mechanisms in computer languages 
such as C/C++. 

Storing the far-field/local moments as a linear array. Although the moments are inherently 
multi-dimensional, we store all coefficients in a C-style one-dimensional array. Each query node 
stores {pmax)^ local moment terms. Similarly, each reference node stores {pmax)^ far-field moment 
terms. These are allocated as a linear array during the construction of the two trees, as shown 
in Figure [THl which implies a bijective mapping between D-digit r&Aiyi-pmax numbers and decimal 
numbers between and p^ax ~ 1 inclusive. 

Converting betvireen a position and a multi-index in the Unear array. Algorithm [T51 shows 
the mapping from a position in the linear array of {pmax)^ terms to its corresponding multi- index. 
The algorithm converts the given position (given in base 10) to a number in base p. Algorithm 1171 
converts the given multi- index to its corresponding position in the linear array of length (pmax) ■ 
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Algorithm 16 PosiTiONToMuLTiiNDEX(i,p): Converts the position of a linear array of length 
to its multi-index, 
{i-th position maps to the multi-index a.} 

for d = D to d = 1 do 

a[d-{D-l)] <- 



return 



mod p 

a 



It is basically an algorithm to convert a ladbc-pmax number to its decimal representation. 

Algorithm 17 MuLTilNDExToPosiTiON(a): Converts the given multi-index to its corresponding 
position in the linear array of length [pmax]^ ■ 
{Converted position from the multi-index.} 

for d = D to d^ \ do 

X ^ X + f ■ a[d\ 

f ^ / ' Pmax 

return x 



Computing a multi-index expansion of a vector. A multi-index expansion of a vector 
x € up to p^ terms is basically the set of coefficients {x°'}a<p- See Figure [T71 This is 
used in the process of forming a far-field moment contribution of a single reference point in 
AccumulateFarFieldMoment and evaluating a local expansion in EvalLocalExpansion. 
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Figure 17: The multi-index expansion of a 2-D vector x = a;[2]]"^ up to 16 terms. 



Implementing the far- field moment accumulation (Equation ((20|) V This is straightforward 
given the implementation of the function 

MultiIndexExpansion. Basically, it computes the multi-index of each reference point in the given 
reference node and accumulates each contribution and normalizes the sum. See Algorithm [T9l 

Implementing the far-to-far translation operator (shown in Algorithm 1^01) . This consists of 
a doubly-nested for-loop over accumulated far-field moments. 

Computing the multivariate Hermite functions. We exploit the fact that the multivariate 
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Algorithm 18 MuLTilNDExExPANSiON(a;,p, M'): Computes M' = {a;"}^^, 
M'[0] ^ 1 

for each i = to i = — 1 do 

{Retrieve the multi-index mapping of the current position.} 

a POSITIONToMULTIINDEX(j,p) 

j the first index of a such that a[j] > 1. 
{Found a direct ancestor of the multiindex map a.} 

a' <~ a, ct'lj] ce'[j] — 1 

{Recursively compute the a-th multi-index component based on a'-th.} 

M'[{\ ^ Af'[MULTllNDEXToPOSITION(a')] • x[j] 



Algorithm 19 AccuMULATEFARFiELDMOMENT(i?): Implements Equation (|20l) . 
{Temporary space that is equal in size to {Ma{R^ R-c)}a<p,nax-} 

1=0, ■■■ ,(Pmoi) — 1 

for each rj^ G R do 

{Add M' = onto {M„(i?,i?.c)}„<p_.} 

MULTllNDEXEXPANSION (j'^'^^'" , p„,ax , M'^ 

{M4R,R.c)}o,<p^^^ ^ {Ma{R,R.c)}a<p^^^+M' 
for z = to z = {pmax)^ - 1 do 
M^{R,R.c) ^ M^{R,R.c) ■ ^ 



Hermite functions is a product of D univariate Hermite functions. Algorithm 1211 computes partial 
derivatives of the Gaussian kernel evaluated at the given point x along each dimension up to p-th. 

D 

order. ha{x) — Y[ ^a[d](a;) is a simple product of the univariate functions (see Algorithm [22]) . 

d=l 

Evaluating a far-field expansion. Once the functions for computing the Hermite functions 
(Algorithni[2T]and Algorithm l22p . we can implement the function for evaluating a far- field expansion 
up to p^ terms, as shown in Algorithm 1231 The basic structure is one outer-loop over each query 
point and the inner loop iterating over each far-field moment. The contribution to each query point 
is computed as a dot-product between the far-field moment and the computed Hermite functions 
(see Figured]). 

Implementing the far-to-local translation operator. The basic structure of the algorithm 
is a doubly nested for-loop, each over the coefficients. The doubly-nested for-loop first translate a 
portion of the accumulated far-field moments of R up to p^ terms into the local moments. The 
final step of the algorithm is to add the translated moments {i^({(i?, T((3.c,p))})} to the local 
moments stored in Q, Lfs{Q.c,TZT>{Q) U TZr{Q)). See Algorithm [24l 

Implementing the direct local accumulation operation. The basic structure is a doubly- 
nested for-loop, the outer-loop over the reference points whose moments are to be accumulated as 
local moments and the inner loop over the coefficient positions. See Algorithm 1251 
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Algorithm 20 TRANSFARToFAR(i?', i?): Implements Equation (1^ . 
{Allocate space for and compute | ( ^^^2/1^'^ ) } 

MULTIINDEXEXPANSION ( ^'v2^f ^ , Pmaa , C*^ 

for Z = to i < (pmaa;)^ do 

7 POSITIONToMULTIINDEX(i,Pmtt^) 
for j = to j < {pmax)° do 

a •(— POSITIONToMULTIINDEX(j,Pmoj;) 

if a < 7 then 

M^(i?, i?.c) ^ M^(i?, i?.c)+ 

^-^M„(i?',i?'.c) • C[MultiIndexToPosition(7 - a)] 



Algorithm 21 COMPUTEPARTiALDERiVATiVES(a,p, iJ): Evaluates the partial derivatives of 
g-£c /(2ft, ) _ ly^Yi order at each coordinate of a. 

for d = 1 to £) do 

H[d][0] ^ e-(«H)' 

if p > 1 then 

if p > 2 then 

for A; = ltoA:=p — 2 do 

H[d][k + 1] ^ 2 • a[d] ■ H[d][k] -2-k- H[d][k - 1] 



Implementing the local-to-local translation operator. We direct readers' attention to the 
first step of the algorithm, which retrieves the maximum order among used in local moment ac- 
cumulation/translation. Then the algorithm proceeds with a doubly- nested for- loop over the local 
moments applies Equation (l?71) . See Algorithm 

Evaluating the local expansion of the given query node. This function (see Algorithm [27| 
is consisted of one outer-loop over reference points and the inner-loop over the local moments up 
to terms, where p is the maximum approximation order used among the reference nodes pruned 
via far-to- local and direct local accumulations for Q. 
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Algorithm 22 CoMPUTEHERMlTEFuNCTlON(if, a): Computes the Hermite function ha{-) using 
the pre-computed partial derivatives H. 

for d = 1 to i) do 

f ^ f ■ H[c[\[a[d]] 
return / 



Algorithm 23 EvALFARFiELDExPANSiON(i?, Q,p): Evaluates the far-field expansion of the given 
reference node R up to terms. 

{Allocate space for holding the partial derivatives.} 

Hd=i,-,D ^0 

fe=0,--- 

for each qi^^ G Q do 

{Compute partial derivatives up to {p — l)-th order along each dimension.} 

COMPUTEPARTIALDERIVATIVES JJ^ 

to 

for i = to i = — 1 do 

a POSITIONToMULTIINDEX(i,p) 

/ ^ COMPUTEHERMITEFUNCTION(iJ, a) 

G{qi^,^l^{q^J)^G{q^^,n^{q^J)+W 
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Algorithm 24 TRANSFARToLoCAL(i?, (3,_p): Implements Equation ((^ . 



H d=i,-,D ^0 

fe=0,--- ,2(p-l) 

ComputePartialDerivatives ^S^^=^, 2p - 1, iJ^ 
for i = to I = — 1 do 

P ^ POSITIONToMULTIINDEX(i,p) 

for J = to J = — 1 do 

a <S— POSITIONToMULTIINDEX(j,p) 

/ -S— COMPUTEHERMITEFUNCTION(i/, a + (3) 

Lp{{{R,T{Q.c,p))}) ^ L0i{{R,T{Q.c,p))}) + M^{R,R.c) ■ f 
Lp{{{R,T{Q.c,pm ^ (^i^({(i?,r(Q.c,p))}) 
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/c=0,--- ,p-l 

for each rj^ E R do 

ComputePartialDerivatives 
for i = to — 1 do 

a POSITIONToMULTIINDEX(i,p) 

/ -S— COMPUTEHERMITEFUNCTION(_ff, ^) 

Lpi{iR,DiQ.c,p))}) ^ Lp{{{R,D{Q.c,pm + f 
{Lpi{{R,D{Q.c,p))})}f!<p ^ {Lp{{{R,D{Q.c,p))})}f,<p * 
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Algorithm 26 TransLocalToLocal((5', Q): Implements Equation (l?7l) . 

{p is the maximum approximation order used among the reference nodes pruned via far-to-local 
and direct local accumulations for Q' .} 

p -i— max < max pu , max pT > 

lReTZT,{Q') ReK-riQ') } 

{Temporary space that is equal in size to {Lp}.} 
X ^0 

MultiIndexExpansion 
for j = to p^ — 1 do 

a ^ POSITIONToMULTIINDEX(j,p) 

for fc = to - 1 do 

13 ^ POSITIONToMULTIINDEX(fc,p) 

if /3 > a then 

LpiQ.c, -RviQ') U nriQ')) ^ Lp{Q.c, TZviQ') U 7^r (Q'))+ 

^^(^L^(Q'.c,7^I,(g') u 7^r(Q'))^0-a 

{LfsiQ.cTZviQ) U 7^r(Q))};3<p ^ {Lp{Q.c,nv{Q) U 7^r(Q))}/3<p + {Lp{Q.c,nv{Q') U 
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Algorithm 27 EvalLocalExpansion(Q): Evaluates the accumulated local expansion of the 
given query node Q. 

{p is the maximum approximation order used among the reference nodes pruned via far-to-local 
and direct local accumulations for Q.} 

p max < max p d , max pT > 

{Temporary space to hold the multi-index expansion of each ^^i^^i:^ .| 
Xi^o, -- ,p°-i ^ 

for each S Q do 

{Compute the multi-index expansion of ^'^"^'^'^ up to p^ terms.} 

MULTllNDEXEXPANSION ( j^'^^f^'" iP, 

for z = to i = p^ — 1 do 

/3 <r- POSITIONToMULTIINDEX(i,p) 

^Z'^ z + Li3{Q.c,nv{Q)U UriQ)) ■ z 

G{qi^,nv{qiJ U TZriqiJ) ^ G{qi^,'Rv{qiJ U UriqiJ) + z 
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